AITopics | preference score

Collaborating Authors

preference score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Preference Completion from Partial Rankings

Suriya Gunasekar, Oluwasanmi O. Koyejo, Joydeep Ghosh

Neural Information Processing SystemsMar-23-2026, 15:22:38 GMT

We propose a novel and efficient algorithm for the collaborative preference completion problem, which involves jointly estimating individualized rankings for a set of entities over a shared set of items, based on a limited number of observed affinity values. Our approach exploits the observation that while preferences are often recorded as numerical scores, the predictive quantity of interest is the underlying rankings. Thus, attempts to closely match the recorded scores may lead to overfitting and impair generalization performance. Instead, we propose an estimator that directly fits the underlying preference order, combined with nuclear norm constraints to encourage low-rank parameters. Besides (approximate) correctness of the ranking order, the proposed estimator makes no generative assumption on the numerical scores of the observations. One consequence is that the proposed estimator can fit any consistent partial ranking over a subset of the items represented as a directed acyclic graph (DAG), generalizing standard techniques that can only fit preference scores. Despite this generality, for supervision representing total or blockwise total orders, the computational complexity of our algorithm is within a log factor of the standard algorithms for nuclear norm regularization based estimates for matrix completion.

algorithm, artificial intelligence, machine learning, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois (0.28)
North America > United States > Texas > Travis County > Austin (0.14)

Industry: Health & Medicine > Therapeutic Area > Neurology (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Aligning Language Models with Human Preferences via a Bayesian Approach Jiashuo W ANG 1, Haozhao W ANG

Neural Information Processing SystemsFeb-16-2026, 02:16:53 GMT

"Rule-of-Thumb" generation, show that our method consistently exceeds previous

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.82)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.64)

Add feedback

Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Pásztor, Barna, Buening, Thomas Kleine, Krause, Andreas

arXiv.org Machine LearningDec-19-2025

We introduce Stackelberg Learning from Human Feedback (SLHF), a new framework for preference optimization. SLHF frames the alignment problem as a sequential-move game between two policies: a Leader, which commits to an action, and a Follower, which responds conditionally on the Leader's action. This approach decomposes preference optimization into a refinement problem for the Follower and an optimization problem against an adversary for the Leader. Unlike Reinforcement Learning from Human Feedback (RLHF), which assigns scalar rewards to actions, or Nash Learning from Human Feedback (NLHF), which seeks a simultaneous-move equilibrium, SLHF leverages the asymmetry of sequential play to capture richer preference structures. The sequential design of SLHF naturally enables inference-time refinement, as the Follower learns to improve the Leader's actions, and these refinements can be leveraged through iterative sampling. We compare the solution concepts of SLHF, RLHF, and NLHF, and lay out key advantages in consistency, data sensitivity, and robustness to intransitive preferences. Experiments on large language models demonstrate that SLHF achieves strong alignment across diverse preference datasets, scales from 0.5B to 8B parameters, and yields inference-time refinements that transfer across model families without further fine-tuning.

learning, tackelberg gda, wen 2, (13 more...)

arXiv.org Machine Learning

2512.16626

Country:

Europe > Switzerland > Zürich > Zürich (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Dual-Weighted Reinforcement Learning for Generative Preference Modeling

Feng, Shengyu, He, Yun, Ma, Shuang, Li, Beibin, Xiong, Yuanhao, Li, Songlin, Mandyam, Karishma, Katz-Samuels, Julian, Bi, Shengjie, Yu, Licheng, Zhang, Hejia, Sankararaman, Karthik Abinav, Fang, Han, Mansour, Riham, Yang, Yiming, Faruqui, Manaal

arXiv.org Artificial IntelligenceOct-23-2025

Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models on tasks with verifiable answers. However, extending RL to more general non-verifiable tasks, typically in the format of human preference pairs, remains both challenging and underexplored. In this work, we propose Dual-Weighted Reinforcement Learning (DWRL), a new framework for preference modeling that integrates CoT reasoning with the Bradley-Terry (BT) model via a dual-weighted RL objective that preserves preference-modeling inductive bias. DWRL approximates the maximum-likelihood objective of the BT model with two complementary weights: an instance-wise misalignment weight, which emphasizes under-trained pairs misaligned with human preference, and a group-wise (self-normalized) conditional preference score, which promotes promising thoughts. In this paper, we apply DWRL to preference modeling by training generative preference models (GPMs) to first generate a thought and then predict the human preference score. Across multiple benchmarks and model scales (Llama3 and Qwen2.5), DWRL consistently outperforms both GPM baselines and scalar models, while producing coherent, interpretable thoughts. In summary, our results position DWRL as a general framework for reasoning-enhanced preference learning beyond verifiable tasks.

machine learning, natural language, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2510.15242

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
(2 more...)

Add feedback

370bfb31abd222b582245b977ea5f25a-AuthorFeedback.pdf

Neural Information Processing SystemsOct-2-2025, 12:57:47 GMT

adain, artificial intelligence, example image, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.30)

Add feedback

Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs

Xian, Jia Jun Cheng, Li, Muchen, Yang, Haotian, Tao, Xin, Wan, Pengfei, Sigal, Leonid, Liao, Renjie

arXiv.org Artificial IntelligenceOct-1-2025

Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables "free-lunch" alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.

artificial intelligence, deep learning, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2509.25771

Country: North America > Canada (0.46)

Genre: Research Report (1.00)

Industry: Government (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Yang, Junming, Xu, Ning, Liu, Biao, Qiao, Shiqi, Geng, Xin

arXiv.org Artificial IntelligenceSep-30-2025

Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs.

arxiv preprint arxiv, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2509.23371

Country: North America > United States (1.00)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

Zhao, Yiyang, Bai, Huiyu, Zhao, Xuejiao

arXiv.org Artificial IntelligenceJun-11-2025

The ability to train high-performing reward models with few-shot data is critical for enhancing the efficiency and scalability of Reinforcement Learning from Human Feedback (RLHF). We propose a data augmentation and expansion framework that enables generative reward models trained on small datasets to achieve comparable performance to those trained on large-scale datasets. Traditional methods to train a generative reward model, such as Direct Preference Optimization (DPO), are constrained by inefficiencies in sample pairing and limited data diversity. This work introduces preference refinement, which employs Chain-of-Thought (CoT) sampling to uncover diverse and high-quality preference relationships. It also incorporates a perplexity-based scoring mechanism to assign nuanced preference levels and utilizes Multi-level Direct Preference Optimization (M-DPO) to enable the model to capture finer-grained preference differences between samples. Experimental results demonstrate that the proposed method significantly enhances data efficiency and model performance, enabling reward models trained in a few-shot setting to achieve results on par with those trained on large-scale datasets. This study underscores the potential of data-efficient strategies in advancing reward model optimization, offering a robust solution for low-resource RLHF applications.

arxiv preprint arxiv, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2506.08965

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions

Guey, William, Bougault, Pierrick, de Moura, Vitor D., Zhang, Wei, Gomes, Jose O.

arXiv.org Artificial IntelligenceMar-30-2025

This study systematically analyzes geopolitical bias across 11 prominent Large Language Models (LLMs) by examining their responses to seven critical topics in U.S.-China relations. Utilizing a bilingual (English and Chinese) and dual-framing (affirmative and reverse) methodology, we generated 19,712 prompts designed to detect ideological leanings in model outputs. Responses were quantitatively assessed on a normalized scale from -2 (strongly Pro-China) to +2 (strongly Pro-U.S.) and categorized according to stance, neutrality, and refusal rates. The findings demonstrate significant and consistent ideological alignments correlated with the LLMs' geographic origins; U.S.-based models predominantly favored Pro-U.S. stances, while Chinese-origin models exhibited pronounced Pro-China biases. Notably, language and prompt framing substantially influenced model responses, with several LLMs exhibiting stance reversals based on prompt polarity or linguistic context. Additionally, we introduced comprehensive metrics to evaluate response consistency across languages and framing conditions, identifying variability and vulnerabilities in model behaviors. These results offer practical insights that can guide organizations and individuals in selecting LLMs best aligned with their operational priorities and geopolitical considerations, underscoring the importance of careful model evaluation in politically sensitive applications. Furthermore, the research highlights specific prompt structures and linguistic variations that can strategically trigger distinct responses from models, revealing methods for effectively navigating and influencing LLM outputs.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2503.23688

Country:

Asia > Taiwan (0.06)
Pacific Ocean > North Pacific Ocean > South China Sea (0.05)
Asia > China > Beijing > Beijing (0.04)
(5 more...)

Genre: Research Report > New Finding (0.48)

Industry: Government > Regional Government > North America Government > United States Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Filters

Collaborating Authors

preference score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Preference Completion from Partial Rankings

Aligning Language Models with Human Preferences via a Bayesian Approach Jiashuo W ANG 1, Haozhao W ANG

Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game

Dual-Weighted Reinforcement Learning for Generative Preference Modeling

99b419554537c66bf27e5eb7a74c7de4-Paper-Conference.pdf

370bfb31abd222b582245b977ea5f25a-AuthorFeedback.pdf

Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

Mapping Geopolitical Bias in 11 Large Language Models: A Bilingual, Dual-Framing Analysis of U.S.-China Tensions